Using Policy Gradients to Account for Changes in Behaviour Policies Using Policy Gradients to Account for Changes in Behaviour Policies under Off-policy Control
نویسندگان
چکیده
Off-policy learning refers to the problem of learning the value function of a behaviour, or policy, while selecting actions with a different policy. Gradient-based off-policy learning algorithms, such as GTD (Sutton et al., 2009b) and TDC/GQ (Sutton et al., 2009a), converge when selecting actions with a fixed policy even when using function approximation and incremental updates. In control problems, the behaviour policy is adapted over time. One key challenge in off-policy control is that adapting the policy results in changing the distribution of subsequent transitions the algorithm will see. We present the first off-policy gradient-based learning algorithm that accounts for how an adjustment of the policy at the current time step effects the distribution of future transition samples. We derive the algorithm in the style of policy gradients and show that our method performs favourably to existing approaches when used for off-policy control with linear function approximation.
منابع مشابه
There Are Many Purposes for Conditional Incentives to Accessing Healthcare; Comment on “Denial of Treatment to Obese Patients—the Wrong Policy on Personal Responsibility for Health”
This commentary is a brief response to Nir Eyal’s argument that health policies should not make healthy behaviour a condition or prerequisite in order to access healthcare as it could result in the people who need healthcare the most not being able to access healthcare. While in general agreement due to the shared concern for equity, I argue that making health behaviour a condition to accessing...
متن کاملتبیین روششناسی تحلیل ذینفعان و کاربرد آن در خطمشیگذاری عمومی
The purpose of this research is to study policy issues and providing operational and scientific guidelines for policy makers. Policy research methods examine the factors influencing public policy and the impacts of policies on society and environment, using different methods to reach this end. Stakeholder analysis is one of these methods that emphasizes on comments of stakeholders to be taken i...
متن کاملExclusion of Ageing in Social Policy Analyzing Development Plans after the Revolution
The living of the elderly has been turned into a problematic issue affected by social, cultural and economic changes, however, this problematic is not registered in policy makings and there is not appropriate social polices associated to the elderly. The subject of the current study is observe social policies in the context of development plans after the Revolution, to discover overt and covert...
متن کاملDesigning a Model for Implementing Energy Policies in the Oil and Gas Sector
By studying the complex process of energy policy formulation for the oil and gas sector in Iran, we notice that these policies are not fully implemented due to inefficiency of the executive model. In this paper, we use the qualitative research of data theory, to analyze the situation through a combination of semi-structured interviews and study of available data. We use a snowball (chain refer...
متن کاملExpected Policy Gradients for Reinforcement Learning
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016